Random Forest Prediction on the Give dataset(data.csv) below.

In [1]:
# Pandas is used for data manipulation
import pandas as pd
# Read in data and display first 5 rows
df1 = pd.read_csv('random_forest_data.csv')
df1.head()
Out[1]:
Unnamed: 0 acousticness analysis_url danceability duration_ms energy id instrumentalness key liveness loudness mode speechiness tempo time_signature track_href type uri valence
0 0 0.40000 https://api.spotify.com/v1/audio-analysis/3AEZ... 0.761 222560 0.838 3AEZUABDXNtecAOSC1qTfo 0.000000 4 0.176 -3.073 0 0.0502 93.974 4 https://api.spotify.com/v1/tracks/3AEZUABDXNte... audio_features spotify:track:3AEZUABDXNtecAOSC1qTfo 0.710
1 1 0.18700 https://api.spotify.com/v1/audio-analysis/6mIC... 0.852 195840 0.773 6mICuAdrwEjh6Y6lroV2Kg 0.000030 8 0.159 -2.921 0 0.0776 102.034 4 https://api.spotify.com/v1/tracks/6mICuAdrwEjh... audio_features spotify:track:6mICuAdrwEjh6Y6lroV2Kg 0.907
2 2 0.05590 https://api.spotify.com/v1/audio-analysis/3QwB... 0.832 209453 0.772 3QwBODjSEzelZyVjxPOHdq 0.000486 10 0.440 -5.429 1 0.1000 96.016 4 https://api.spotify.com/v1/tracks/3QwBODjSEzel... audio_features spotify:track:3QwBODjSEzelZyVjxPOHdq 0.704
3 3 0.00431 https://api.spotify.com/v1/audio-analysis/7DM4... 0.663 259196 0.920 7DM4BPaS7uofFul3ywMe46 0.000017 11 0.101 -4.070 0 0.2260 99.935 4 https://api.spotify.com/v1/tracks/7DM4BPaS7uof... audio_features spotify:track:7DM4BPaS7uofFul3ywMe46 0.533
4 4 0.55100 https://api.spotify.com/v1/audio-analysis/6rQS... 0.508 205600 0.687 6rQSrBHf7HlZjtcMZ4S4bO 0.000003 0 0.126 -4.361 1 0.3260 180.044 4 https://api.spotify.com/v1/tracks/6rQSrBHf7HlZ... audio_features spotify:track:6rQSrBHf7HlZjtcMZ4S4bO 0.555
In [2]:
df1.isnull().sum()
Out[2]:
Unnamed: 0          0
acousticness        0
analysis_url        0
danceability        0
duration_ms         0
energy              0
id                  0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
speechiness         0
tempo               0
time_signature      0
track_href          0
type                0
uri                 0
valence             0
dtype: int64
In [3]:
print('The shape of our features is:', df1.shape)
The shape of our features is: (75800, 19)
In [4]:
import numpy as np

df1['analysis_url'].unique()

# convert string feature(song_title and artist) to int64 
df1["analysis_url"] = pd.factorize(df1["analysis_url"])[0].astype(np.int64)
df1["id"] = pd.factorize(df1["id"])[0].astype(np.int64)
df1["track_href"] = pd.factorize(df1["track_href"])[0].astype(np.int64)
df1["type"] = pd.factorize(df1["type"])[0].astype(np.int64)
df1["uri"] = pd.factorize(df1["uri"])[0].astype(np.int64)
In [5]:
df1.drop(columns="Unnamed: 0")
Out[5]:
acousticness analysis_url danceability duration_ms energy id instrumentalness key liveness loudness mode speechiness tempo time_signature track_href type uri valence
0 0.40000 0 0.761 222560 0.838 0 0.000000 4 0.1760 -3.073 0 0.0502 93.974 4 0 0 0 0.710
1 0.18700 1 0.852 195840 0.773 1 0.000030 8 0.1590 -2.921 0 0.0776 102.034 4 1 0 1 0.907
2 0.05590 2 0.832 209453 0.772 2 0.000486 10 0.4400 -5.429 1 0.1000 96.016 4 2 0 2 0.704
3 0.00431 3 0.663 259196 0.920 3 0.000017 11 0.1010 -4.070 0 0.2260 99.935 4 3 0 3 0.533
4 0.55100 4 0.508 205600 0.687 4 0.000003 0 0.1260 -4.361 1 0.3260 180.044 4 4 0 4 0.555
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
75795 0.20900 986 0.842 206787 0.777 986 0.000009 6 0.2290 -3.869 0 0.2260 120.081 4 986 0 986 0.751
75796 0.49700 1100 0.917 249001 0.739 1100 0.005040 1 0.0956 -6.005 0 0.1300 132.022 4 1100 0 1100 0.805
75797 0.08800 1105 0.926 198824 0.551 1105 0.000000 2 0.1080 -6.679 1 0.2700 131.972 4 1105 0 1105 0.528
75798 0.06410 1049 0.596 206394 0.897 1049 0.000052 10 0.0628 -2.940 1 0.0462 118.000 4 1049 0 1049 0.578
75799 0.08520 1098 0.771 196653 0.816 1098 0.000000 8 0.1510 -4.028 0 0.1250 120.111 4 1098 0 1098 0.728

75800 rows × 18 columns

In [6]:
y = df1.valence
In [47]:
# One-hot encode the data using pandas get_dummies
#df1 = pd.get_dummies(df1)
# Display
#df1.head()
Out[47]:
Unnamed: 0 acousticness danceability duration_ms energy instrumentalness key liveness loudness mode ... uri_spotify:track:7xHWNBFm6ObGEQPaUxHuKO uri_spotify:track:7xmp7f74I0rxUOPjVuIOE8 uri_spotify:track:7xyyjOyiYVJCT3CmJl7HwW uri_spotify:track:7y6c07pgjZvtHI9kuMVqk1 uri_spotify:track:7yHEDfrJNd0zWOfXwydNH0 uri_spotify:track:7yjTvUlvx7S1pneUODlGBg uri_spotify:track:7ynCQo1KpBOyTdTdAnjSLZ uri_spotify:track:7yyRTcZmCiyzzJlNzGC9Ol uri_spotify:track:7zgqtptZvhf8GEmdsM2vp2 uri_spotify:track:7zkQwd9ZjsqvGexq5oQ4m6
0 0 0.40000 0.761 222560 0.838 0.000000 4 0.176 -3.073 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0.18700 0.852 195840 0.773 0.000030 8 0.159 -2.921 0 ... 0 0 0 0 0 0 0 0 0 0
2 2 0.05590 0.832 209453 0.772 0.000486 10 0.440 -5.429 1 ... 0 0 0 0 0 0 0 0 0 0
3 3 0.00431 0.663 259196 0.920 0.000017 11 0.101 -4.070 0 ... 0 0 0 0 0 0 0 0 0 0
4 4 0.55100 0.508 205600 0.687 0.000003 0 0.126 -4.361 1 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 4607 columns

In [22]:
#split dataset in features and target variable

#removed artist and song_title because all values are unique, 

# Use numpy to convert to arrays
import numpy as np
# Labels are the values we want to predict
y = df1['valence']

features = ["acousticness", "analysis_url","danceability", "duration_ms", "energy","id", "instrumentalness", "key", "liveness", "loudness", "mode", "speechiness", "tempo", "time_signature","track_href","type","uri"]
target = ["valence"]
# Remove the labels from the features
# axis 1 refers to the columns
X= df1[features]
#X = df1.drop(columns="Unnamed: 0")

print(X)
print(y)
       acousticness  analysis_url  danceability  duration_ms  energy    id  \
0           0.40000             0         0.761       222560   0.838     0   
1           0.18700             1         0.852       195840   0.773     1   
2           0.05590             2         0.832       209453   0.772     2   
3           0.00431             3         0.663       259196   0.920     3   
4           0.55100             4         0.508       205600   0.687     4   
...             ...           ...           ...          ...     ...   ...   
75795       0.20900           986         0.842       206787   0.777   986   
75796       0.49700          1100         0.917       249001   0.739  1100   
75797       0.08800          1105         0.926       198824   0.551  1105   
75798       0.06410          1049         0.596       206394   0.897  1049   
75799       0.08520          1098         0.771       196653   0.816  1098   

       instrumentalness  key  liveness  loudness  mode  speechiness    tempo  \
0              0.000000    4    0.1760    -3.073     0       0.0502   93.974   
1              0.000030    8    0.1590    -2.921     0       0.0776  102.034   
2              0.000486   10    0.4400    -5.429     1       0.1000   96.016   
3              0.000017   11    0.1010    -4.070     0       0.2260   99.935   
4              0.000003    0    0.1260    -4.361     1       0.3260  180.044   
...                 ...  ...       ...       ...   ...          ...      ...   
75795          0.000009    6    0.2290    -3.869     0       0.2260  120.081   
75796          0.005040    1    0.0956    -6.005     0       0.1300  132.022   
75797          0.000000    2    0.1080    -6.679     1       0.2700  131.972   
75798          0.000052   10    0.0628    -2.940     1       0.0462  118.000   
75799          0.000000    8    0.1510    -4.028     0       0.1250  120.111   

       time_signature  track_href  type   uri  
0                   4           0     0     0  
1                   4           1     0     1  
2                   4           2     0     2  
3                   4           3     0     3  
4                   4           4     0     4  
...               ...         ...   ...   ...  
75795               4         986     0   986  
75796               4        1100     0  1100  
75797               4        1105     0  1105  
75798               4        1049     0  1049  
75799               4        1098     0  1098  

[75800 rows x 17 columns]
0        0.710
1        0.907
2        0.704
3        0.533
4        0.555
         ...  
75795    0.751
75796    0.805
75797    0.528
75798    0.578
75799    0.728
Name: valence, Length: 75800, dtype: float64

Apply the crossvalidation on the given data and show all the accuracies for each split.

In [23]:
from sklearn.model_selection import train_test_split # Import train_test_split function

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
In [24]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 20, random_state = 42)
# Train the model on training data
rf.fit(X_train, y_train)
Out[24]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=20,
                      n_jobs=None, oob_score=False, random_state=42, verbose=0,
                      warm_start=False)
In [25]:
rf.score(X_test, y_test)
Out[25]:
0.9968985837589573
In [26]:
y_pred = rf.predict(X_test)
print(y_pred)

for i in range(1,20):
    model = RandomForestRegressor(n_estimators=i)
    model.fit(X_train, y_train)
    print("Model score for no of trees",i," is : ",model.score(X_test, y_test))
[0.364 0.922 0.307 ... 0.465 0.847 0.41 ]
Model score for no of trees 1  is :  0.9925385238135918
Model score for no of trees 2  is :  0.9942336607519262
Model score for no of trees 3  is :  0.9959155799137585
Model score for no of trees 4  is :  0.9957290563954198
Model score for no of trees 5  is :  0.9965559124047491
Model score for no of trees 6  is :  0.9958963211607081
Model score for no of trees 7  is :  0.99635561605756
Model score for no of trees 8  is :  0.9963030137824409
Model score for no of trees 9  is :  0.9969232580022542
Model score for no of trees 10  is :  0.9967700289409174
Model score for no of trees 11  is :  0.996688493524748
Model score for no of trees 12  is :  0.9968835014793475
Model score for no of trees 13  is :  0.9966865730820158
Model score for no of trees 14  is :  0.9969814880306654
Model score for no of trees 15  is :  0.9969940114027337
Model score for no of trees 16  is :  0.9964764893933888
Model score for no of trees 17  is :  0.9971321363767855
Model score for no of trees 18  is :  0.9971293285908212
Model score for no of trees 19  is :  0.996767304810825
In [27]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error: 0.0008352144923709342
Mean Squared Error: 0.00014699363321941524
Root Mean Squared Error: 0.012124093088533065

Random Forest along with suitable Plots at the end

In [28]:
from sklearn.tree import export_graphviz

features = ["acousticness", "analysis_url","danceability", "duration_ms", "energy","id", "instrumentalness", "key", "liveness", "loudness", "mode", "speechiness", "tempo", "time_signature","track_href","type","uri"]
target = ["valence"]
estimator = rf.estimators_[5]
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = features,
                class_names = target,
                rounded = True, proportion = False, 
                precision = 2, filled = True)


from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])


from IPython.display import Image
Image(filename = 'tree.png')
Out[28]:
In [ ]:
Random sampling of data points, combined with random sampling of a subset of the features at each node of the tree, is why the model is called a random forest.
Furthermore, notice that in our tree, there are only 2 variables we actually used to make a prediction! According to this particular decision tree, the rest of the features are not important for making a prediction. Month of the year, day of the month, and our friend’s prediction are utterly useless for predicting the maximum temperature tomorrow! The only important information according to our simple tree is the temperature 1 day prior and the historical average. Visualizing the tree has increased our domain knowledge of the problem, and we now know what data to look for if we are asked to make a prediction!